Constructing and Using Broad-coverage Lexical Resource for Enhancing Morphological Analysis of Arabic
نویسندگان
چکیده
Broad-coverage language resources which provide prior linguistic knowledge must improve the accuracy and the performance of NLP applications. We are constructing a broad-coverage lexical resource to improve the accuracy of morphological analyzers and part-of-speech taggers of Arabic text. Over the past 1200 years, many different kinds of Arabic language lexicons were constructed; these lexicons are different in ordering, size and aim or goal of construction. We collected 23 machine-readable lexicons, which are freely available on the web. We combined lexical resources into one large broad-coverage lexical resource by extracting information from disparate formats and merging traditional Arabic lexicons. To evaluate the broad-coverage lexical resource we computed coverage over the Qur’an, the Corpus of Contemporary Arabic, and a sample from the Arabic Web Corpus, using two methods. Counting exact word matches between test corpora and lexicon scored about 65-68%; Arabic has a rich morphology with many combinations of roots, affixes and clitics, so about a third of words in the corpora did not have an exact match in the lexicon. The second approach is to compute coverage in terms of use in a lemmatizer program, which strips clitics to look for a match for the underlying lexeme; this scored about 82-85%.
منابع مشابه
Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text
Morphological analyzers and part-of-speech taggers are key technologies for most text analysis applications. Our aim is to develop a part-of-speech tagger for annotating a wide range of Arabic text formats, domains and genres including both vowelized and non-vowelized text. Enriching the text with linguistic analysis will maximize the potential for corpus re-use in a wide range of applications....
متن کاملSemi-Automatic Data Annotation, POS Tagging and Mildly Context-Sensitive Disambiguation: the eXtended Revised AraMorph (XRAM)
An extended, revised form of Tim Buckwalter’s Arabic lexical and morphological resource AraMorph, eXtended Revised AraMorph (henceforth XRAM), is presented which addresses a number of weaknesses and inconsistencies of the original model by allowing a wider coverage of real-world Classical and contemporary (both formal and informal) Arabic texts. Building upon previous research, XRAM enhancement...
متن کاملA Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer
Current Arabic lexicons, whether computational or otherwise, make no distinction between entries from Modern Standard Arabic (MSA) and Classical Arabic (CA), and tend to include obsolete words that are not attested in current usage. We address this problem by building a large-scale, corpus-based lexical database that is representative of MSA. We use an MSA corpus of 1,089,111,204 words, a pre-a...
متن کاملIMSLex { Representing Morphological and Syntactic Information in a Relational Database
We present a lexical resource comprising morphological and syntactic information. The resource is realised as a relational database, which facilitates the access and administration of the data. Sophisticated tools have been developed to allow a user-friendly usage of the resource. One application, a broad coverage parser, which makes use of both the morphological and syntactic part of the datab...
متن کاملCross-Lingual Induction for Deep Broad-Coverage Syntax: A Case Study on German Participles
This paper is a case study on cross-lingual induction of lexical resources for deep, broad-coverage syntactic analysis of German. We use a parallel corpus to induce a classifier for German participles which can predict their syntactic category. By means of this classifier, we induce a resource of adverbial participles from a huge monolingual corpus of German. We integrate the resource into a Ge...
متن کامل